-
Notifications
You must be signed in to change notification settings - Fork 294
Inference tutorial - Part 3 of e2e series [WIP] #2343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 52b93fe with merge base 2898903 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
docs/source/inference.rst
Outdated
|
||
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3 | ||
|
||
Inference with vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this after Inference with Transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jainapurva I think if vLLM is our recommended serving solution, this should go before transformers.
docs/source/inference.rst
Outdated
|
||
vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements. | ||
|
||
Setting up vLLM with Quantized Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesn't have to be a new section I think
Hi @jainapurva, by the way I'm adding a ![]() |
b93b892
to
ce675b8
Compare
docs/source/inference.rst
Outdated
.. note:: | ||
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_. | ||
|
||
Inference with vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm
it might be easier to do command line compared to code
docs/source/serving.rst
Outdated
print(f"Output: {generated_text!r}") | ||
print("-" * 60) | ||
|
||
[Optional] Inference with Transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have an Inference w/ SGlang section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the integration of TorchAO and SGLang, came across a lot of issues in running the server. As discussed with @jerryzh168 offline, we can add this later, after more thorough testing and updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we at least add an SGLang section and say (Coming soon!)
or something? It's in the diagram at the top right now so people may search for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, overall I feel we should add some more text in between code blocks so it feels more like a tutorial, and remove some duplicate code, which is distracting to readers
quantized_model.push_to_hub(save_to, safe_serialization=False) | ||
tokenizer.push_to_hub(save_to) | ||
|
||
# Manual Testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split this into a separate code block and add some text in between, since everything below this line is technically not part of the user flow
output_text = tokenizer.batch_decode( | ||
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False | ||
) | ||
print("Response:", output_text[0][len(prompt):]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add an example of what is printed here?
|
||
model_id = "microsoft/Phi-4-mini-instruct" | ||
|
||
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this to the top like the other imports?
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly | ||
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
|
||
.. code-block:: bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add some text here. E.g. we need to explain we're serving with the quantized checkpoint we pushed to HF hub above. Also would be good to clarify that "float8dq" stands for "dynamic quant"
"top_p": 0.95, | ||
"top_k": 20, | ||
"max_tokens": 32768 | ||
}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also quickly summarize that serving the float8 with vLLM is X times faster than serving the original high precision model, e.g. from the model card: https://huggingface.co/pytorch/Phi-4-mini-instruct-float8dq#quantization-recipe
Serve using vLLM with 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100.
from transformers import ( | ||
AutoModelForCausalLM, | ||
AutoProcessor, | ||
AutoTokenizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting is off, here and other code blocks below
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
|
||
print(untied_model) | ||
from transformers.modeling_utils import find_tied_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move import to top
Step 1: Untie Embedding Weights | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step necessary actually? I don't think I had to do any of this for Llama models for example. Can you share the source for this?
|
||
torch.cuda.reset_peak_memory_stats() | ||
|
||
prompt = "Hey, are you conscious? Can you talk to me?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this code duplicated 3 times. Can we just have it appear in 1 place? Having it under "Evaluation" makes sense to me
|
||
Memory Benchmarking | ||
^^^^^^^^^^^^^^^^^ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add some text here. "For the Phi-4-mini-instruct model, serving with float8 dynamic quantization used X% less memory" or something
No description provided.